# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 27 12:37:42 2023"
I feel fairly comfortable working with RStudio as it is one of the main tools I use daily. However, after browsing trough the “R for health Data Science” I feel there are several ways to use R. I tend to structure my code in rather different fashion. Perhaps the most striking difference is the use of pipeline. Example file uses pipelines everywhere! where as I tend to avoid it as i feel it makes the code hard to follow even if it would shorten and simplify the code. For example if we create random data frame with tree columns
exampledata1=c(2,4,6,6,8,9,10)
exampledata2=c(70.2,44.0,62.7,106.1,81.1,96.1,10.8)
exampledata3=c(16,23,10,12,9,10,17)
df=data.frame(exampledata1,exampledata2,exampledata3)
print(df)
## exampledata1 exampledata2 exampledata3
## 1 2 70.2 16
## 2 4 44.0 23
## 3 6 62.7 10
## 4 6 106.1 12
## 5 8 81.1 9
## 6 9 96.1 10
## 7 10 10.8 17
and we want to caluclate mean of the 3rd column, I would use following command:
mean(df$exampledata3)
## [1] 13.85714
whereas excercise dataset would use command
df$exampledata3 %>% mean()
## [1] 13.85714
Both ways lead to same results, even if syntax is different.
Another difference is plotting. Excercise uses ggplot2 where I am used to use R base plots. I know ggplot2 is more versatile and widely used tool. How ever i find its logic confusing compared to base plots. I wish i will learn to use ggplot2 during this course.
Overall i cannot say how well exercise set worked as crash course as most of this was already familiar to me. How ever the parts about ggplot2 was my favorite as it helped me to understand logic behind commands. Using markup is not familiar to me - i have only used Rscripts earlier.
date()
## [1] "Mon Nov 27 12:37:43 2023"
First step is to read the data created during the data wrangling part. In this case it is stored on my computer.
learningdata=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/learning2014.csv")
head(learningdata)
## X gender Age Attitude deep stra surf Points
## 1 1 F 53 37 3.583333 3.375 2.583333 25
## 2 2 M 55 31 2.916667 2.750 3.166667 12
## 3 3 F 49 25 3.500000 3.625 2.250000 24
## 4 4 M 53 35 3.500000 3.125 2.250000 10
## 5 5 M 49 37 3.666667 3.625 2.833333 22
## 6 6 F 38 38 4.750000 3.625 2.416667 21
Data appears to be in good order. After importing data it is possible to inspect it.
str(learningdata)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
There are 8 different variables (X, gender, age, attitude, deep, stra, surf and points). Data includes different variables related to learning. X is number used to identify observations. It is irrelevant for the data exploration as it is not measurement etc. Just a tool to identify different persons. Therefore it can be excluded from graphical investigation. Scaterplot matrix is a useful tool to understand different relationships between variables. Gender could be treated as dummy-varable as it is ether male of female in this data. To clarify the visuals we can use different collours on both gender.
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learningdata[-1], mapping = aes(alpha=0.3, col=gender), lower = list(combo = wrap("facethist", bins = 20)))
p
Correlation between variables can be seen in the top/right part of the graph (total and gender separately). left/bottom part of the graph shows scatter plot. Axis on th middle shows the distribution of observations. Top row is a boxplot on effect of gender to different variables. left column is histogram of variable by gender. Top left corner is histogram of genders. There are several aspects we can observe from this one graph.
To create model we should select variables that have high correlation with explained variable (points in this case). How ever if 2 variables have high correlation with explained variable, but they are also highly correlated with each other, they should be not both used. In this case points are correlated with attitude. Other attributes have quite low correlation, but variables stra and surf has some level correlation with points, so we add them to model as well.
model = lm(Points ~ Attitude + stra +surf, data = learningdata)
summary(model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to the summary table, surf has extraordinary large p-value and therefore model can be simplified.
model2 = lm(Points ~ Attitude + stra, data = learningdata)
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
If we drop the surf variable form the model, we get model more simple model with higher r-squared and smaller p-value. Now the instructions of this exercise advise that non significant terms should be removed. How ever the question of significance is not black and white. It is more about how much uncertainty we are ready to accept. In model2 variable stra has p-value if 0.089. If we use 95% confidence level, the term is ot significant (p > 0.05) but if we use 90% confidence, it is (p < 0,1). Luckily we have more tools as decide whether or not we wish to simplify the model further.
model3 = lm(Points ~ Attitude, data = learningdata)
summary(model3)
##
## Call:
## lm(formula = Points ~ Attitude, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
If we simplify the model even further we start to loose some of validity of the model as adjusted p-value starts to decrease and therefore we might not want to only have one explanatory variable and therefore we are happy with model2.
We can further analyze model variables using Bayesian Information Criterion (BIC)
library(flexmix)
## Warning: package 'flexmix' was built under R version 4.2.3
## Loading required package: lattice
BIC(model)
## [1] 1046.051
BIC(model2)
## [1] 1041.486
BIC(model3)
## [1] 1039.323
According to BIC the model with only attitude as predictor is the best despite lower R-squared.
We can yet further use tools to compare different models. Command “step” drops predictors one by one until the smallest AIC value has reaches and further simplification would increase the AIC.
step(model)
## Start: AIC=557.4
## Points ~ Attitude + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 15.00 4559.4 555.95
## <none> 4544.4 557.40
## - stra 1 69.61 4614.0 557.93
## - Attitude 1 980.95 5525.3 587.85
##
## Step: AIC=555.95
## Points ~ Attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4559.4 555.95
## - stra 1 81.74 4641.1 556.90
## - Attitude 1 1051.89 5611.3 588.41
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Coefficients:
## (Intercept) Attitude stra
## 8.9729 0.3466 0.9137
Smallest AIC score is achieved with Attitude and stra as predictors (=model2), suggesting it is the best model. It probably could be argued that model2 and model3 are both valid, but i would prefer model2. So let’s take a closer look on that.
summary(model2)
##
## Call:
## lm(formula = Points ~ Attitude + stra, data = learningdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## Attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
According to summary table model intercept is 8.97. Parameter for attitude is 0.35 and for stra 0.91. So the formula for model would be \(Points=8.97+attitude*0.35+stra*0.91\). This means that 1 unit increase in attitude, increases points 0.35 and one unit increase of stra increases the points by 0.91. If both attetude and stra is 0, model suggests that test subject recive about than 9 points.
R-squared explains how much variation of dependent variable is explained by independent variable. .Multiple R-squared explains how much variation in dependent variable is explained by the model variables. 1 would mean all variation is explained by it and 0 would mean no variation is explained. Multiple \(R^2\) of this model is 0.20, suggesting that 20 % of the variation is explained by the independent variables. In addition summary table provides a adjusted \(R^2\). This adjustment makes it easier to compare models with different ammounts of predictorors, as higher number of predictors tend to increase \(R^2\) even if increasing the number if predictors would not increase the quality of the model as predictors will explain some percent of variance - even by coincidence.
plot(model2, which = 1)
Residuals vs fitted values are used to check for un-linearity. Residuals describe the distance between observation and model. If residuals have increasing or decreasing trend, we have issues with the model. Ie. we should have used nonlinear regression instead of linear. We can also detect outliers using this plot. the average of the residuals are close to zero trough the graph and there are no clear patterns so all is good. How ever we can detect there are some outliers in the data illustrated at the lower middle part of the graph.
plot(model2, which = 2)
Normal Q-Q plot sorts observations by residuals into ascending order. Majority of observations are where one would expect them to be in normally distributed data. Only the tails are a bit off from expected values. How ever as we have limited sample size I would not be too worried about that. As a conclusion: data is or is close to be normally distributed.
plot(model2, which = 5)
Residuals vs Leverage is used to find influential observations. High leverage means changes in that observations would affect the model more compared to observations with low leverage. In this case it seems there are no highly influential observations as all of them are within cook’s distance (we can’t even see the threshold in this zoom). We can also see that residuals do not really change as a function of a leverage, indicating yet again normal distribution and that variances of observations have no patterns.
Disclaimer! I have had super busy week and unfortunately I did not have enough time to really focus this week and I am coding this last minute before deadline. So if and when i have made typos and silly mistakes, please be patient and I am sorry.
date()
## [1] "Mon Nov 27 12:37:58 2023"
First step is to read the data created during the data wrangling part. In this case it is stored on my computer.
alc=read.csv("C:/Users/03196349/Work Folders/PhD opinnot/IODS/IODS-project/data/alcdata.csv")
head(alc)
## X school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 2 GP F 17 U GT3 T 1 1 at_home other
## 3 3 GP F 15 U LE3 T 1 1 at_home other
## 4 4 GP F 15 U GT3 T 4 2 health services
## 5 5 GP F 16 U GT3 T 3 3 other other
## 6 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime schoolsup famsup activities nursery
## 1 course mother 2 2 yes no no yes
## 2 course father 1 2 no yes no no
## 3 other mother 1 2 yes no no yes
## 4 home mother 1 3 no yes yes yes
## 5 home father 1 2 no yes no yes
## 6 reputation mother 1 2 no yes yes yes
## higher internet romantic famrel freetime goout Dalc Walc health failures paid
## 1 yes no no 4 3 4 1 1 3 0 no
## 2 yes yes no 5 3 3 1 1 3 0 no
## 3 yes yes no 4 3 2 2 3 3 2 yes
## 4 yes yes yes 3 2 2 1 1 5 0 yes
## 5 yes no no 4 3 2 1 2 5 0 yes
## 6 yes yes no 5 4 2 1 2 5 0 yes
## absences G1 G2 G3 alc_use high_use
## 1 5 2 8 8 1.0 FALSE
## 2 3 7 8 8 1.0 FALSE
## 3 8 10 10 11 2.5 TRUE
## 4 1 14 14 14 1.0 FALSE
## 5 2 8 12 12 1.5 FALSE
## 6 8 14 14 14 1.5 FALSE
Let’s remove the first column “x” as it is duplicate of row numbers
alc=alc[,-1]
Let’s take a look on our data
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
We have 35 columns describing information about students. Some variables describe background information. Variables G1, G2 and G3 decribes the grades of testsubjects. Alc_use tells how large is alcohol consumption and high use tells whether the consumption is high or not.
The 4 variables i would like to choose are absence, G3, pstatus and age. I assume that high absence, low grades, broken families and older age is linked to high consumption of alcohol.
First I would like to check how those variables are related to each other. But before that, let’s instal some packages!
library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.2.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
table(alc$Pstatus, alc$high_use)
##
## FALSE TRUE
## A 26 12
## T 233 99
g1 <- ggplot(alc, aes(x = high_use, y = age))
g2 <- ggplot(alc, aes(x = high_use, y = G3))
g3 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot()
g2 + geom_boxplot()
g3 + geom_boxplot()
Using tables and boxplots we can assume
Therefore hypotheses presented seem to be correct.
m <- glm(high_use ~ G3 + absences + age + Pstatus, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences + age + Pstatus, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3352 -0.8239 -0.7050 1.2037 1.8781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.52154 1.82438 -1.382 0.166931
## G3 -0.06845 0.03602 -1.900 0.057371 .
## absences 0.08199 0.02335 3.511 0.000446 ***
## age 0.12084 0.10147 1.191 0.233666
## PstatusT 0.05517 0.39732 0.139 0.889572
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 428.47 on 365 degrees of freedom
## AIC: 438.47
##
## Number of Fisher Scoring iterations: 4
According to the summary of created model only G3 and absences are statistically significant. Family status has only little to do with the alcohol consumption. More simple model would be better. For that we can use step-command that finds the model with smallest (=best) AIC score.
step(m)
## Start: AIC=438.47
## high_use ~ G3 + absences + age + Pstatus
##
## Df Deviance AIC
## - Pstatus 1 428.49 436.49
## - age 1 429.89 437.89
## <none> 428.47 438.47
## - G3 1 432.09 440.09
## - absences 1 442.85 450.85
##
## Step: AIC=436.49
## high_use ~ G3 + absences + age
##
## Df Deviance AIC
## - age 1 429.93 435.93
## <none> 428.49 436.49
## - G3 1 432.17 438.17
## - absences 1 443.07 449.07
##
## Step: AIC=435.93
## high_use ~ G3 + absences
##
## Df Deviance AIC
## <none> 429.93 435.93
## - G3 1 434.52 438.52
## - absences 1 445.68 449.68
##
## Call: glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = alc)
##
## Coefficients:
## (Intercept) G3 absences
## -0.38732 -0.07606 0.08423
##
## Degrees of Freedom: 369 Total (i.e. Null); 367 Residual
## Null Deviance: 452
## Residual Deviance: 429.9 AIC: 435.9
In this case such model would have just 2 predictors: absence and G3. So let’s create such model.
m <- glm(high_use ~ G3 + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ G3 + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3286 -0.8298 -0.7219 1.2113 1.9242
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38732 0.43500 -0.890 0.373259
## G3 -0.07606 0.03553 -2.141 0.032311 *
## absences 0.08423 0.02309 3.648 0.000264 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 429.93 on 367 degrees of freedom
## AIC: 435.93
##
## Number of Fisher Scoring iterations: 4
So according to he model the astimate for parameter G3 is -0.08, and parameter for absence is 0.084. intercept is -0.39. To calculate odds ratios, we use following command:
OR <- coef(m) %>% exp
OR
## (Intercept) G3 absences
## 0.6788751 0.9267616 1.0878762
Ratios describe the change of predictable variable if the explainable variable is changed one unit.
CI=confint(m)
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) -1.25260555 0.459482521
## G3 -0.14621889 -0.006471056
## absences 0.04110528 0.131708308
This provides 95% confidence intervals. 97.5 being the upper and 2.5 the lower interval.
Hypothesis vise this would mean that age and family status does not have important role on does one consume high amounts of alcohol or not. And that high grades means less alcohol and hing absences is tied to high alcohol consumption. How eve it is unclear which one is the cause and which is the result.
alc = mutate(alc, probability = predict(m, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table()
## prediction
## high_use FALSE TRUE
## FALSE 0.68108108 0.01891892
## TRUE 0.25675676 0.04324324
model was good at predicting if high use is false. How ever it also often assumed high use of alcohol even tough actually it was not the case.
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
g+geom_point()
The visual representation shows that the model often assumes High use to be false even tough it actually would be true.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
The training error (the mean number of wrong predictions) is 0,28
library(boot)
## Warning: package 'boot' was built under R version 4.2.3
##
## Attaching package: 'boot'
## The following object is masked from 'package:flexmix':
##
## boot
## The following object is masked from 'package:lattice':
##
## melanoma
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2810811
The average number of wrong predictions is 0,2811, which is a greater than error in exercise set (0.26). This means we have worse model than in exercise 3. Lets try to find better model.
m2 <- glm(high_use ~ G3 + absences+ school + sex + address+ famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup+ famsup+ activities + nursery + higher + internet + romantic + famrel + freetime + goout + health + failures + paid + absences + G1 + G2, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ G3 + absences + school + sex + address +
## famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason +
## guardian + traveltime + studytime + schoolsup + famsup +
## activities + nursery + higher + internet + romantic + famrel +
## freetime + goout + health + failures + paid + absences +
## G1 + G2, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7620 -0.6544 -0.3590 0.4831 2.8990
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.10711 1.79919 -1.727 0.084177 .
## G3 0.05976 0.11463 0.521 0.602121
## absences 0.09105 0.02706 3.365 0.000766 ***
## schoolMS -0.48058 0.53496 -0.898 0.369001
## sexM 0.84885 0.33281 2.551 0.010755 *
## addressU -0.85479 0.41943 -2.038 0.041552 *
## famsizeLE3 0.51405 0.33078 1.554 0.120176
## PstatusT -0.29228 0.53799 -0.543 0.586937
## Medu 0.04021 0.22256 0.181 0.856645
## Fedu 0.14434 0.19799 0.729 0.465991
## Mjobhealth -1.07147 0.76687 -1.397 0.162353
## Mjobother -0.85282 0.50337 -1.694 0.090222 .
## Mjobservices -0.82754 0.58036 -1.426 0.153896
## Mjobteacher -0.34813 0.70825 -0.492 0.623052
## Fjobhealth 0.33260 1.19154 0.279 0.780142
## Fjobother 0.81603 0.86311 0.945 0.344431
## Fjobservices 1.19959 0.88620 1.354 0.175853
## Fjobteacher -0.34774 1.04522 -0.333 0.739367
## reasonhome 0.55292 0.39470 1.401 0.161262
## reasonother 1.52294 0.54867 2.776 0.005508 **
## reasonreputation 0.01215 0.41787 0.029 0.976799
## guardianmother -0.83744 0.37652 -2.224 0.026136 *
## guardianother -0.13975 0.80413 -0.174 0.862026
## traveltime 0.32446 0.24022 1.351 0.176797
## studytime -0.27488 0.20833 -1.319 0.187012
## schoolsupyes 0.03707 0.47710 0.078 0.938071
## famsupyes -0.07901 0.33491 -0.236 0.813504
## activitiesyes -0.58242 0.31635 -1.841 0.065608 .
## nurseryyes -0.48277 0.36927 -1.307 0.191090
## higheryes -0.30306 0.73540 -0.412 0.680263
## internetyes 0.09333 0.45868 0.203 0.838771
## romanticyes -0.44975 0.33749 -1.333 0.182662
## famrel -0.57891 0.17076 -3.390 0.000698 ***
## freetime 0.29873 0.17003 1.757 0.078928 .
## goout 0.90284 0.15767 5.726 1.03e-08 ***
## health 0.18851 0.11495 1.640 0.101042
## failures 0.33167 0.27916 1.188 0.234790
## paidyes 0.53225 0.32442 1.641 0.100883
## G1 -0.13568 0.12864 -1.055 0.291563
## G2 0.10089 0.15969 0.632 0.527532
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 310.30 on 330 degrees of freedom
## AIC: 390.3
##
## Number of Fisher Scoring iterations: 5
step(m2)
## Start: AIC=390.3
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian +
## traveltime + studytime + schoolsup + famsup + activities +
## nursery + higher + internet + romantic + famrel + freetime +
## goout + health + failures + paid + absences + G1 + G2
##
## Df Deviance AIC
## - Mjob 4 314.58 386.58
## - schoolsup 1 310.30 388.30
## - Medu 1 310.33 388.33
## - internet 1 310.34 388.34
## - famsup 1 310.35 388.35
## - higher 1 310.47 388.47
## - G3 1 310.57 388.57
## - Pstatus 1 310.59 388.59
## - G2 1 310.70 388.70
## - Fedu 1 310.83 388.83
## - Fjob 4 317.06 389.06
## - school 1 311.12 389.12
## - G1 1 311.40 389.40
## - failures 1 311.73 389.73
## - nursery 1 311.99 389.99
## - studytime 1 312.08 390.08
## - romantic 1 312.11 390.11
## - traveltime 1 312.14 390.14
## <none> 310.30 390.30
## - famsize 1 312.70 390.70
## - paid 1 313.03 391.03
## - health 1 313.05 391.05
## - freetime 1 313.44 391.44
## - guardian 2 315.67 391.67
## - activities 1 313.75 391.75
## - address 1 314.45 392.45
## - reason 3 319.54 393.54
## - sex 1 316.96 394.96
## - absences 1 321.59 399.59
## - famrel 1 322.20 400.20
## - goout 1 350.56 428.56
##
## Step: AIC=386.58
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + Fjob + reason + guardian + traveltime +
## studytime + schoolsup + famsup + activities + nursery + higher +
## internet + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Fjob 4 320.37 384.37
## - Medu 1 314.58 384.58
## - schoolsup 1 314.59 384.59
## - internet 1 314.63 384.63
## - Pstatus 1 314.67 384.67
## - famsup 1 314.72 384.72
## - higher 1 314.82 384.82
## - G2 1 314.84 384.84
## - G3 1 314.92 384.92
## - school 1 315.37 385.37
## - G1 1 315.71 385.71
## - studytime 1 315.74 385.74
## - Fedu 1 315.77 385.77
## - failures 1 316.08 386.08
## - romantic 1 316.33 386.33
## - health 1 316.39 386.39
## - nursery 1 316.44 386.44
## - traveltime 1 316.50 386.50
## <none> 314.58 386.58
## - guardian 2 318.64 386.64
## - famsize 1 317.00 387.00
## - freetime 1 317.40 387.40
## - activities 1 317.68 387.68
## - paid 1 318.01 388.01
## - reason 3 322.94 388.94
## - address 1 319.29 389.29
## - sex 1 321.42 391.42
## - famrel 1 326.22 396.22
## - absences 1 326.30 396.30
## - goout 1 353.06 423.06
##
## Step: AIC=384.37
## high_use ~ G3 + absences + school + sex + address + famsize +
## Pstatus + Medu + Fedu + reason + guardian + traveltime +
## studytime + schoolsup + famsup + activities + nursery + higher +
## internet + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Pstatus 1 320.39 382.39
## - internet 1 320.40 382.40
## - schoolsup 1 320.41 382.41
## - Medu 1 320.44 382.44
## - higher 1 320.51 382.51
## - G3 1 320.61 382.61
## - famsup 1 320.72 382.72
## - G2 1 320.84 382.84
## - Fedu 1 320.90 382.90
## - school 1 321.08 383.08
## - studytime 1 321.52 383.52
## - traveltime 1 321.79 383.79
## - failures 1 321.83 383.83
## - nursery 1 322.04 384.04
## - G1 1 322.23 384.23
## - romantic 1 322.23 384.23
## <none> 320.37 384.37
## - health 1 322.38 384.38
## - freetime 1 322.53 384.53
## - famsize 1 322.93 384.93
## - activities 1 323.16 385.16
## - guardian 2 325.32 385.32
## - reason 3 327.69 385.69
## - paid 1 324.81 386.81
## - address 1 325.55 387.55
## - sex 1 327.22 389.22
## - famrel 1 330.47 392.47
## - absences 1 332.72 394.72
## - goout 1 360.29 422.29
##
## Step: AIC=382.39
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## schoolsup + famsup + activities + nursery + higher + internet +
## romantic + famrel + freetime + goout + health + failures +
## paid + G1 + G2
##
## Df Deviance AIC
## - internet 1 320.42 380.42
## - schoolsup 1 320.43 380.43
## - Medu 1 320.45 380.45
## - higher 1 320.53 380.53
## - G3 1 320.65 380.65
## - famsup 1 320.75 380.75
## - G2 1 320.85 380.85
## - Fedu 1 320.93 380.93
## - school 1 321.12 381.12
## - studytime 1 321.53 381.53
## - traveltime 1 321.81 381.81
## - failures 1 321.86 381.86
## - nursery 1 322.05 382.05
## - romantic 1 322.25 382.25
## - G1 1 322.29 382.29
## - health 1 322.39 382.39
## <none> 320.39 382.39
## - freetime 1 322.53 382.53
## - famsize 1 323.01 383.01
## - activities 1 323.26 383.26
## - guardian 2 325.33 383.33
## - reason 3 327.71 383.71
## - paid 1 324.81 384.81
## - address 1 325.57 385.57
## - sex 1 327.24 387.24
## - famrel 1 330.51 390.51
## - absences 1 332.91 392.91
## - goout 1 360.30 420.30
##
## Step: AIC=380.42
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## schoolsup + famsup + activities + nursery + higher + romantic +
## famrel + freetime + goout + health + failures + paid + G1 +
## G2
##
## Df Deviance AIC
## - schoolsup 1 320.45 378.45
## - Medu 1 320.47 378.47
## - higher 1 320.57 378.57
## - G3 1 320.67 378.67
## - famsup 1 320.77 378.77
## - G2 1 320.87 378.87
## - Fedu 1 320.95 378.95
## - school 1 321.13 379.13
## - studytime 1 321.55 379.55
## - traveltime 1 321.83 379.83
## - failures 1 321.87 379.87
## - nursery 1 322.13 380.13
## - romantic 1 322.25 380.25
## - G1 1 322.30 380.30
## <none> 320.42 380.42
## - health 1 322.42 380.42
## - freetime 1 322.59 380.59
## - famsize 1 323.06 381.06
## - activities 1 323.26 381.26
## - guardian 2 325.37 381.37
## - reason 3 327.73 381.73
## - paid 1 324.93 382.93
## - address 1 325.67 383.67
## - sex 1 327.31 385.31
## - famrel 1 330.52 388.52
## - absences 1 333.27 391.27
## - goout 1 360.46 418.46
##
## Step: AIC=378.45
## high_use ~ G3 + absences + school + sex + address + famsize +
## Medu + Fedu + reason + guardian + traveltime + studytime +
## famsup + activities + nursery + higher + romantic + famrel +
## freetime + goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - Medu 1 320.50 376.50
## - higher 1 320.61 376.61
## - G3 1 320.69 376.69
## - famsup 1 320.80 376.80
## - G2 1 320.89 376.89
## - Fedu 1 320.97 376.97
## - school 1 321.13 377.13
## - studytime 1 321.60 377.60
## - traveltime 1 321.84 377.84
## - failures 1 321.90 377.90
## - nursery 1 322.16 378.16
## - romantic 1 322.27 378.27
## - G1 1 322.32 378.32
## <none> 320.45 378.45
## - health 1 322.49 378.49
## - freetime 1 322.61 378.61
## - famsize 1 323.10 379.10
## - activities 1 323.35 379.35
## - guardian 2 325.38 379.38
## - reason 3 327.75 379.75
## - paid 1 325.06 381.06
## - address 1 325.79 381.79
## - sex 1 327.67 383.67
## - famrel 1 330.54 386.54
## - absences 1 333.37 389.37
## - goout 1 360.66 416.66
##
## Step: AIC=376.5
## high_use ~ G3 + absences + school + sex + address + famsize +
## Fedu + reason + guardian + traveltime + studytime + famsup +
## activities + nursery + higher + romantic + famrel + freetime +
## goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - higher 1 320.66 374.66
## - G3 1 320.75 374.75
## - famsup 1 320.87 374.87
## - G2 1 320.93 374.93
## - Fedu 1 321.03 375.03
## - school 1 321.18 375.18
## - studytime 1 321.65 375.65
## - traveltime 1 321.91 375.91
## - failures 1 322.01 376.01
## - nursery 1 322.28 376.28
## - romantic 1 322.35 376.35
## - G1 1 322.37 376.37
## <none> 320.50 376.50
## - health 1 322.54 376.54
## - freetime 1 322.65 376.65
## - famsize 1 323.18 377.18
## - activities 1 323.42 377.42
## - guardian 2 325.60 377.60
## - reason 3 327.81 377.81
## - paid 1 325.07 379.07
## - address 1 325.93 379.93
## - sex 1 327.67 381.67
## - famrel 1 330.59 384.59
## - absences 1 333.40 387.40
## - goout 1 360.66 414.66
##
## Step: AIC=374.66
## high_use ~ G3 + absences + school + sex + address + famsize +
## Fedu + reason + guardian + traveltime + studytime + famsup +
## activities + nursery + romantic + famrel + freetime + goout +
## health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - G3 1 320.88 372.88
## - famsup 1 321.06 373.06
## - G2 1 321.10 373.10
## - Fedu 1 321.14 373.14
## - school 1 321.38 373.38
## - studytime 1 321.86 373.86
## - traveltime 1 322.07 374.07
## - failures 1 322.33 374.33
## - romantic 1 322.38 374.38
## - nursery 1 322.40 374.40
## - G1 1 322.55 374.55
## <none> 320.66 374.66
## - health 1 322.67 374.67
## - freetime 1 322.79 374.79
## - famsize 1 323.32 375.32
## - activities 1 323.79 375.79
## - guardian 2 325.86 375.86
## - reason 3 327.93 375.93
## - paid 1 325.11 377.11
## - address 1 326.17 378.17
## - sex 1 328.20 380.20
## - famrel 1 330.73 382.73
## - absences 1 333.75 385.75
## - goout 1 360.88 412.88
##
## Step: AIC=372.88
## high_use ~ absences + school + sex + address + famsize + Fedu +
## reason + guardian + traveltime + studytime + famsup + activities +
## nursery + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - famsup 1 321.23 371.23
## - Fedu 1 321.32 371.32
## - school 1 321.66 371.66
## - studytime 1 322.19 372.19
## - traveltime 1 322.34 372.34
## - failures 1 322.42 372.42
## - G2 1 322.54 372.54
## - romantic 1 322.56 372.56
## - G1 1 322.60 372.60
## - nursery 1 322.63 372.63
## <none> 320.88 372.88
## - health 1 322.91 372.91
## - freetime 1 323.01 373.01
## - famsize 1 323.45 373.45
## - guardian 2 325.96 373.96
## - activities 1 323.98 373.98
## - reason 3 327.99 373.99
## - paid 1 325.33 375.33
## - address 1 326.35 376.35
## - sex 1 328.32 378.32
## - famrel 1 330.73 380.73
## - absences 1 334.31 384.31
## - goout 1 361.17 411.17
##
## Step: AIC=371.23
## high_use ~ absences + school + sex + address + famsize + Fedu +
## reason + guardian + traveltime + studytime + activities +
## nursery + romantic + famrel + freetime + goout + health +
## failures + paid + G1 + G2
##
## Df Deviance AIC
## - Fedu 1 321.57 369.57
## - school 1 321.87 369.87
## - traveltime 1 322.66 370.66
## - studytime 1 322.74 370.74
## - failures 1 322.78 370.78
## - G2 1 322.87 370.87
## - G1 1 322.87 370.87
## - nursery 1 322.96 370.96
## - romantic 1 322.98 370.98
## - health 1 323.12 371.12
## <none> 321.23 371.23
## - freetime 1 323.23 371.23
## - famsize 1 323.87 371.87
## - guardian 2 326.22 372.22
## - activities 1 324.31 372.31
## - reason 3 328.42 372.42
## - paid 1 325.35 373.35
## - address 1 326.60 374.60
## - sex 1 329.13 377.13
## - famrel 1 330.86 378.86
## - absences 1 334.61 382.61
## - goout 1 361.79 409.79
##
## Step: AIC=369.57
## high_use ~ absences + school + sex + address + famsize + reason +
## guardian + traveltime + studytime + activities + nursery +
## romantic + famrel + freetime + goout + health + failures +
## paid + G1 + G2
##
## Df Deviance AIC
## - school 1 322.18 368.18
## - traveltime 1 322.86 368.86
## - failures 1 322.87 368.87
## - nursery 1 323.07 369.07
## - G1 1 323.12 369.12
## - studytime 1 323.23 369.23
## - G2 1 323.23 369.23
## - romantic 1 323.28 369.28
## - freetime 1 323.44 369.44
## <none> 321.57 369.57
## - health 1 323.58 369.58
## - famsize 1 324.02 370.02
## - activities 1 324.48 370.48
## - reason 3 328.70 370.70
## - guardian 2 327.05 371.05
## - paid 1 325.82 371.82
## - address 1 326.98 372.98
## - sex 1 329.62 375.62
## - famrel 1 331.29 377.29
## - absences 1 335.46 381.46
## - goout 1 363.61 409.61
##
## Step: AIC=368.18
## high_use ~ absences + sex + address + famsize + reason + guardian +
## traveltime + studytime + activities + nursery + romantic +
## famrel + freetime + goout + health + failures + paid + G1 +
## G2
##
## Df Deviance AIC
## - traveltime 1 323.22 367.22
## - G1 1 323.54 367.54
## - nursery 1 323.57 367.57
## - failures 1 323.64 367.64
## - G2 1 323.71 367.71
## - studytime 1 323.76 367.76
## - freetime 1 323.86 367.86
## - romantic 1 324.04 368.04
## <none> 322.18 368.18
## - health 1 324.53 368.53
## - famsize 1 324.55 368.55
## - reason 3 328.79 368.79
## - activities 1 324.93 368.93
## - guardian 2 327.39 369.39
## - paid 1 326.30 370.30
## - address 1 326.99 370.99
## - sex 1 330.30 374.30
## - famrel 1 331.74 375.74
## - absences 1 336.59 380.59
## - goout 1 364.02 408.02
##
## Step: AIC=367.22
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid + G1 + G2
##
## Df Deviance AIC
## - G1 1 324.57 366.57
## - G2 1 324.63 366.63
## - nursery 1 324.69 366.69
## - freetime 1 324.75 366.75
## - failures 1 324.81 366.81
## - studytime 1 324.99 366.99
## - romantic 1 325.10 367.10
## <none> 323.22 367.22
## - reason 3 329.53 367.53
## - health 1 325.57 367.57
## - famsize 1 325.97 367.97
## - activities 1 326.01 368.01
## - guardian 2 329.24 369.24
## - paid 1 327.40 369.40
## - address 1 330.64 372.64
## - sex 1 331.41 373.41
## - famrel 1 332.55 374.55
## - absences 1 337.62 379.62
## - goout 1 366.10 408.10
##
## Step: AIC=366.57
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid + G2
##
## Df Deviance AIC
## - G2 1 324.70 364.70
## - freetime 1 325.93 365.93
## - nursery 1 326.11 366.11
## - failures 1 326.24 366.24
## - studytime 1 326.54 366.54
## <none> 324.57 366.57
## - health 1 327.05 367.05
## - romantic 1 327.05 367.05
## - famsize 1 327.24 367.24
## - reason 3 331.26 367.26
## - activities 1 327.40 367.40
## - guardian 2 330.30 368.30
## - paid 1 329.01 369.01
## - address 1 331.70 371.70
## - sex 1 332.66 372.66
## - famrel 1 334.08 374.08
## - absences 1 338.85 378.85
## - goout 1 366.53 406.53
##
## Step: AIC=364.7
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + freetime +
## goout + health + failures + paid
##
## Df Deviance AIC
## - freetime 1 326.08 364.08
## - nursery 1 326.21 364.21
## - failures 1 326.24 364.24
## - studytime 1 326.57 364.57
## <none> 324.70 364.70
## - health 1 327.09 365.09
## - romantic 1 327.32 365.32
## - activities 1 327.43 365.43
## - reason 3 331.46 365.46
## - famsize 1 327.50 365.50
## - guardian 2 330.37 366.37
## - paid 1 329.18 367.18
## - address 1 331.74 369.74
## - sex 1 332.86 370.86
## - famrel 1 334.27 372.27
## - absences 1 338.85 376.85
## - goout 1 367.17 405.17
##
## Step: AIC=364.08
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + nursery + romantic + famrel + goout +
## health + failures + paid
##
## Df Deviance AIC
## - nursery 1 327.51 363.51
## - failures 1 327.64 363.64
## - studytime 1 328.04 364.04
## <none> 326.08 364.08
## - health 1 328.44 364.44
## - romantic 1 328.46 364.46
## - activities 1 328.54 364.54
## - reason 3 332.61 364.61
## - famsize 1 328.73 364.73
## - guardian 2 332.13 366.13
## - paid 1 330.49 366.49
## - address 1 332.70 368.70
## - famrel 1 335.05 371.05
## - sex 1 335.40 371.40
## - absences 1 339.96 375.96
## - goout 1 375.82 411.82
##
## Step: AIC=363.51
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + romantic + famrel + goout + health +
## failures + paid
##
## Df Deviance AIC
## - failures 1 329.01 363.01
## <none> 327.51 363.51
## - famsize 1 329.70 363.70
## - romantic 1 329.86 363.86
## - studytime 1 329.87 363.87
## - activities 1 329.88 363.88
## - reason 3 333.89 363.89
## - health 1 329.95 363.95
## - paid 1 331.64 365.64
## - guardian 2 334.22 366.22
## - address 1 334.73 368.73
## - famrel 1 336.36 370.36
## - sex 1 336.42 370.42
## - absences 1 340.99 374.99
## - goout 1 376.85 410.85
##
## Step: AIC=363.01
## high_use ~ absences + sex + address + famsize + reason + guardian +
## studytime + activities + romantic + famrel + goout + health +
## paid
##
## Df Deviance AIC
## <none> 329.01 363.01
## - famsize 1 331.20 363.20
## - romantic 1 331.25 363.25
## - reason 3 335.60 363.60
## - activities 1 331.61 363.61
## - health 1 331.97 363.97
## - studytime 1 332.12 364.12
## - paid 1 332.59 364.59
## - guardian 2 335.95 365.95
## - address 1 336.70 368.70
## - sex 1 338.10 370.10
## - famrel 1 338.76 370.76
## - absences 1 343.00 375.00
## - goout 1 382.34 414.34
##
## Call: glm(formula = high_use ~ absences + sex + address + famsize +
## reason + guardian + studytime + activities + romantic + famrel +
## goout + health + paid, family = "binomial", data = alc)
##
## Coefficients:
## (Intercept) absences sexM addressU
## -1.72681 0.09174 0.91034 -0.94771
## famsizeLE3 reasonhome reasonother reasonreputation
## 0.45183 0.20790 1.02210 -0.27714
## guardianmother guardianother studytime activitiesyes
## -0.75088 0.40855 -0.32528 -0.46361
## romanticyes famrel goout health
## -0.45457 -0.48800 0.90858 0.17681
## paidyes
## 0.54838
##
## Degrees of Freedom: 369 Total (i.e. Null); 353 Residual
## Null Deviance: 452
## Residual Deviance: 329 AIC: 363
m3 <- glm(high_use ~ absences + sex + address + famsize +
reason + guardian + studytime + activities + romantic + famrel +
goout + health + paid, data = alc, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ absences + sex + address + famsize +
## reason + guardian + studytime + activities + romantic + famrel +
## goout + health + paid, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8674 -0.7171 -0.4029 0.5821 2.7665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.72681 0.94633 -1.825 0.068041 .
## absences 0.09174 0.02432 3.773 0.000161 ***
## sexM 0.91034 0.30568 2.978 0.002901 **
## addressU -0.94771 0.34320 -2.761 0.005756 **
## famsizeLE3 0.45183 0.30426 1.485 0.137536
## reasonhome 0.20790 0.36585 0.568 0.569858
## reasonother 1.02210 0.48846 2.093 0.036393 *
## reasonreputation -0.27714 0.37752 -0.734 0.462877
## guardianmother -0.75088 0.33290 -2.256 0.024096 *
## guardianother 0.40855 0.73349 0.557 0.577533
## studytime -0.32528 0.18830 -1.727 0.084081 .
## activitiesyes -0.46361 0.28927 -1.603 0.109002
## romanticyes -0.45457 0.30736 -1.479 0.139160
## famrel -0.48800 0.15813 -3.086 0.002028 **
## goout 0.90858 0.13835 6.567 5.13e-11 ***
## health 0.17681 0.10399 1.700 0.089096 .
## paidyes 0.54838 0.29252 1.875 0.060834 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 329.01 on 353 degrees of freedom
## AIC: 363.01
##
## Number of Fisher Scoring iterations: 5
alc2 = mutate(alc, probability = predict(m3, type = "response"))
alc = mutate(alc, prediction =probability>0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(alc$high_use,alc$probability)
## [1] 0.2756757
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2810811
I tried to create as good model as possible by introducing all variable to a model and used step function to find model with smallest AIC ending up with model m3. How ever when calculating the error, it was 0.2783. Just slightly smaller than on the original model. In addition, i am coding this late at night as the deadline is looming, so i must just give up and declare that i cannot find better model than on exercise 3.
date()
## [1] "Mon Nov 27 12:38:07 2023"
Let’s import data. We will use “Boston” data from MASS package.
library(MASS)
## Warning: package 'MASS' was built under R version 4.2.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
data("Boston")
Now we can explore how the data looks
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Dataset “Boston” has 506 observations and 14 variables. Most of the varables are numeric, but two are integrals. Variables of the data and their diecription are presented bellow.
| Variable | Descripition |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable |
| nox | nitrogen oxides concentration |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | proportion of blacks by town |
| lstat | lower status of the population |
| medv | median value of owner-occupied homes in $1000s. |
Summary of the data
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here you can see the pair-wise comparisation of each variable. Matrix is fairly small in size but provides some visual insigts.
One can see that some variables have little relationships (such as age
and ptratio) whereas some have strong relationship (nox and dis).
Correlation matrix could be easier to interpret.
## Warning: package 'corrplot' was built under R version 4.2.3
## corrplot 0.92 loaded
##Scaling the data
boston_scaled= scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling the mean of each variable is 0. So instead of absolute values the value suggests how far from the mean within that variable the observation is.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
Here we can find see what variables separates the data into different
groups. It seems that rad has the greatest effect on LD1 where as nox
has the greatest effect on LD2. If i interpret the graph correctly,
access to high ways is linked to high crime rate and nitrogen oxides to
medium high rates.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 9 3 0
## med_low 7 15 8 0
## med_high 1 4 9 3
## high 0 0 0 28
In the most cases the model made correct predictions. Especially success rate on predicting high crime rate was good. Predicting medium low crimes seems to be the most difficult. How ever the predictions are not perfect: for example in one case the model predicted medium high crime rate even tough actual rate was low.
data(Boston)
scal_bos=as.data.frame(scale(Boston))
dist_eu <- dist(scal_bos)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
##K-means
Let’s calculate k-means
km <- kmeans(scal_bos, centers = 5)
summary(km)
## Length Class Mode
## cluster 506 -none- numeric
## centers 70 -none- numeric
## totss 1 -none- numeric
## withinss 5 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 5 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
How ever the means calculated above would be better if we would fing out the optimal number of clusetrs. So let us do that!
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(scal_bos, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
km <- kmeans(scal_bos, centers = 2)
pairs(scal_bos, col = km$cluster)
It is hard to see what is going on, so let’s take a closer look on some
of interesting looking variables
pairs(scal_bos[c(1,10)], col = km$cluster)
pairs(scal_bos[c(1,7)], col = km$cluster)
For example on these two plots one can see clear clustering with crime
and tax. High tax and high crime rate seems to create two different
clusters. This suggests that there are less crime in areas with high
tax.
It also seems that crime rate increases in with owner-occupied units built prior to 1940.
km <- kmeans(scal_bos, centers = 4)
lda.fit <- lda(km$cluster ~ ., data = scal_bos)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2.5)
The most influential variable is black on LD2. On LD1 it is hard to determine, but it might be nox.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
model_predictors <- dplyr::select(train, -crime)
lda.fit <- lda(crime ~ ., data = train)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)